The distribution of word length in technical Russian
نویسنده
چکیده
IN the course of an analysis of several samples of technical Russian undertaken as part of a study in mechanical translation, a number of statistical data reflecting the structure of these samples were compiled. One of these, the distribution of word length, is presented here as Fig. 1. The theoretical interest of this distribution arises from the possibility of using it as a basis for an operational definition of words in printed texts. If texts are considered purely as sequences of symbols including the letters, punctuation marks, and space, the resulting sequences are of a length which no practicable machine can manage. A study of the distribution of the number of symbols between pairs of successive symbols of certain classes would be one way to reveal structural characteristics of the text sequences potentially useful toward the definition of manageable and significant subsequences. The subsequences included between successive occurrences of letter pairs have not been investigated. Those included between successive pairs of periods, exclamation points or question marks can be identified with the classical sentence, and finally, those included between successive pairs of punctuation marks or spaces can be identified with words. The length distribution of the latter subsequences has the desirable property, not shared by the others, of being concentrated at relatively low values of length, and of having no elements exceeding a certain length (Fig. 1). Words, defined in this fashion, can readily be identified by a machine and they are of limited variety, so that their listing in a dictionary is practicable. From the practical point of view, the distribution is useful in planning input and storage facilities in experimental translating equipment. The samples used were relatively small, and Fig. 1 should therefore be interpreted with great caution. The bar graph represents the distribution of a sample totalling 6,486 words. Points are used to indicate the distributions obtained from smaller constituents of the total. The scattering is such as to indicate that samples 1, 2, and 3 differ significantly among each other in details of their distributions. An examination of the texts indicates that these differences can safely be attributed to differing subject matter and styles. However, all distributions are bimodal, perhaps trimodal, and cut off at k=18. The mode about k= 7 is attributable to the large number of different words used to define the particular subject of each text. The peaks at k= 1 and at k= 3 are due to a small number of very frequent "grammatical words," that is, prepositions, conjunctions, etc. The five most frequent words of length 1, 2, and 3 in the total sample are listed in Table 1. This table shows that the most frequent two letter words are consistently less frequent than three letter words of similar rank. One and two letter words are exclusively grammatical; 90% of the three letter words are also grammatical, leaving 10% dependent on the subject matter. The words of length 4 are nearly all inflected. The fact that only very few Russian words have stems of three or less letters probably accounts for the valley at k= 4. Indications thus are that the modal and cut-off structure of the distributions are functions of the structure of the Russian language, while variations within these structures are characteristic of individual authors. For those who might wish to draw their own conclusions, the raw data is given in Table 2, and the sources of the samples are listed in Table 3. Letter, diagram and suffix distributions compiled from the same samples may be found in the reference.
منابع مشابه
Testing Problems in Russian as a Foreign Language in a Technical University
Problems of theory and practice of the Russian as a foreign language testing for entrants in technical universities are considered. The benefits of test forms for controlling the foreign students’ skills in the Russian language during a hard time limit are presented. The structure and content of the tests, all types of tasks offered on the entrance and final examinations in the Russian languag...
متن کاملIncreasing the Effectiveness of Russian Language Teaching for Special Purposes (to the Problem of Integration of Language Training with Information Technology Courses)
The article is devoted to the problem of increasing the efficiency of language teaching for the special purposes of foreign students in studying Russian at a technical university. Particular attention is paid to the training of foreign students in the skills of working with information using the latest computer technology. The conclusions of the work are based on the analysis of the results of ...
متن کاملOn the Role of Derivational Processes in the Formation of Non-Taxonomic Classes of Lexical Units in Russian
The paper is focused on classes of lexical units which arise as a result of derivational processes – word formation and semantic transfers, acting either in isolation or together, on the basis of common semantic foundations that bind targets and sources of derivation. The lexical items which constitute the classes under study vary in their denotative characteristics and due to their categ...
متن کاملRhyming Compounds as Elements of a Language Game (In Russian and English Languages)
The article is devoted to the study of composite rhyming compounds as a means of word formation games. It explores the place of this category of words in the lexical system and peculiarities of their use in the Russian and English languages. Authors of the article represent compound words as a special lexical subgroup. On the specific publicistic material are revealed the peculiarities of compo...
متن کاملBertrand’s Paradox Revisited: More Lessons about that Ambiguous Word, Random
The Bertrand paradox question is: “Consider a unit-radius circle for which the length of a side of an inscribed equilateral triangle equals 3 . Determine the probability that the length of a ‘random’ chord of a unit-radius circle has length greater than 3 .” Bertrand derived three different ‘correct’ answers, the correctness depending on interpretation of the word, random. Here we employ geomet...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Mechanical Translation
دوره 1 شماره
صفحات -
تاریخ انتشار 1954